MatchDetectReveal: finding overlapping and similar digital documents

نویسندگان

  • Krisztián Monostori
  • Arkady B. Zaslavsky
  • Heinz W. Schmidt
چکیده

The Internet provides easy access to large collections of semi-structured digital documents. WWW browsers, search engines and the "cut & paste" technique are tempting to substitute one's creativity by simple compilation from appropriate digital resources. This paper discusses the problems of detecting plagiarism in large collections of semi-structured electronic texts. Overlaps in and similarity of digital documents and software code are in the focus of this project. The conceptual architecture of the MatchDetectReveal system is presented along with possible applications. The main component of the system is using the string matching algorithms and a suffix tree representation. Both sequential and parallel cluster-based processing issues are addressed. The implementation and performance issues are also discussed.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Using the MatchDetectReveal System for Comparative Analysis of Texts

In this paper we are introducing the MatchDetectReveal system, which is capable of identifying the similarity between documents. Different applications of the system are discussed including cross-referencing multiple editions of literary works, plagiarism detection, organizing collections of documents and comparative analysis of texts. The system uses suffix trees and suffix vectors for compari...

متن کامل

Analyzing registry, log files, and prefetch files in finding digital evidence in graphic design applications

The products of graphic design applications leave behind traces of digital information which can be used during a digital forensic investigation in cases where counterfeit documents have been created. This paper analyzes the digital forensics involved in the creation of counterfeit documents. This is achieved by first recognizing the digital forensic artifacts left behind from the use of graphi...

متن کامل

مدیریت کلید در سیستم‌های مدیریت حقوق دیجیتال در حالت برون‌خطی

By expanding application of digital content in the world of information technology, supervision and control over the data, and also preventing the copy of documents is considered. In this relation digital rights management systems are responsible for the secure distribution of digital content, and for this purpose the common functions in the field of cryptography and utilize Digital watermarkin...

متن کامل

Anti-Serendipity: Finding Useless Documents and Similar Documents

The problem of finding your way through a relatively unknown collection of digital documents can be daunting. Such collections sometimes have few categories and little hierarchy, or they have so much hierarchy that valuable relations between documents can easily become obscured. We describe here how our work in the area of termrecognition and sentence-based summarization can be used to filter t...

متن کامل

Parallel and Distributed Overlap Detection on the Web

Proliferation of digital libraries plus availability of electronic documents from the Internet have created new challenges for computer science researchers and professionals. Documents are easily copied and redistributed or used to create plagiarised assignments and conference papers. This paper presents a new, two-stage approach for identifying overlapping documents. The first stage is identif...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2000